© Guven

Word Embeddings¶

Introduction¶

Word embedding is a collective term for unsupervised ML models that learn to map a set of words $w$ in a vocabulary (or phrases, stems, lemmas) to vectors of numerical values. This approach reduces the number of dimensions from the number of unique words $V$ (i.e. vocabulary) to a much lower value $N$ where dimensions are shared by all words, such that vectors $w^{\prime}$ are not orthogonal any more. In addition, the way embeddings are computed, the ML models discover patterns in the word relations (such as given a context).

Given a word vocabulary $W=\{w_i\}$, $\|\texttt{Vocabulary}\|=V$, each word is represented by a unit vector (e.g. one-hot encoding in ML) such as

$w_1=\begin{bmatrix}1 \\ 0 \\ \vdots \\ 0\end{bmatrix}$, $w_2=\begin{bmatrix}0 \\ 1 \\ \vdots \\ 0\end{bmatrix}$, $w_N=\begin{bmatrix}0 \\ 0 \\ \vdots \\ 1\end{bmatrix}$, $W\in \mathbb{R}^{V \times V}$

Then the word embedding maps this representation $w_i$, $i=1,2,\dots,V$ to another representation (through Skip-gram or CBOW modeling) where the vectors are in a smaller dimension $N$, and $N << V$, and vectors $w^{\prime}_i$, $i=1,2,\dots,V$, with $\|W^{\prime}\|=N$, and $W^{\prime}\in \mathbb{R}^{N \times V}$

Both Skip-Gram and Continuous Bag of Words (CBOW) models use a neural network architecture to model the word mapping from the original $W$ to the embedding $W^{\prime}$

Word representation represents the word in a vector space $W^{\prime}$ so that if the word vectors are close to one another means that those words are related to one other.

Recall that above representation of words also used in the Tf-Idf features matrix where each column is a word and each row is a document or a sentence.

Note: Check

  1. bag of words - similar meanings
  2. skip gram not necessarily similar meaning but proximity, nearby word

N-Gram¶

An n-gram is a contiguous sequence of n items from a given sample of text. The items can be letters, syllables, or words according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams is also called shingles.

Text similarity:

  • k-shingling as a similarity measure, counting k-shingles from documents
  • Jaccard similarity, $J(set_1, set_2)=\dfrac{\|set_1 \cap set_2\|}{\|set_1 \cup set_2\|}$
  • MinHash, a fast approximation to the Jaccard similarity
  • Locality sensitive hashing

Context¶

Define the context referring to a symmetric window centered on the target word $w_t$, containing the surrounding tokens at a distance less than some window size $\texttt{ws}$ : $C_t = \{w_k | k \in [t-\texttt{ws}, t+\texttt{ws}]\}$

Skip-Gram ¶

Skip-gram model predicts the context words within a specific window given the current word. The input layer of the neural network model uses the current word and the output layer uses the context words. The hidden layer contains nodes matching the number of dimensions $N$.

Skip-gram learns by predicting the context for a given target word maximizing $\prod\limits_{t=1}^{T}P(C_t|w_t)$.

Example¶

"The man who passes the sentence should swing the sword." Ned Stark

Sliding window size $\texttt{ws}= 5$

Window Target Context
[The,man,who] the man,who
[The,man,who,passes] man the,who,passes
[The,man,who,passes,the] who the,man,passes,the
[man,who,passes,the,sentence] passes man,who,the,sentence
[sentence,should,swing,the,sword] swing sentence,should,the,sword
[should,swing,the,sword] the should,swing,sword
[swing,the,sword] sword swing,the

CBOW¶

Continuous Bag of Words model predicts the current word given the context words within a specific window. The input layer uses the context words and the output layer uses the current word. The hidden layer is of length $N$. CBOW is the opposite of Skip-gram.

The CBOW model tries to predict the target word given its context, maximizing the likelihood $\prod\limits_{t=1}^{T}P(w_t|C_t)$

To model Skip-gram or CBOW probabilities, a Softmax activation is used on top of the inner product between a target vector $\texttt{u}_{wt}$ and its context vector $\frac{1}{C_t}\sum_{w \in C_t}\texttt{v}_w$

Disadvantage: A limitation of word embeddings is that possible multiple meanings of a word are conflated into a single representation (unlike Wordnet which this knowledge is carried in the graph)

Disadvantages of WordNet¶

  • Missing nuances. WordNet is not well-posed or direct to model the definition of the subtle difference between two entities. Practically, defining nuances is subjective. An example nuance: the words want and need have similar meanings, but need is more assertive.
  • WordNet is subjective in itself as WordNet was designed by a relatively small community
  • Maintaining WordNet is labor-intensive (such as adding new synsets, definitions, lemmas, etc.)
  • Developing WordNet for other languages is costly

Solution: Develop a data-driven WordNet such as based on word embeddings.

Shortcomings of Bag of Words Method¶

  • It ignores the order of the word, e.g., this is bad = bad is this
  • It ignores the context of words. Example: "He loved books. Education is best found in books". Two vectors one for "He loved books" and other for "Education is best found in books." would be created and a direct approach could treat these two vectors as orthogonal (i.e. independent), losing the relation and context.

NLP WORKFLOW¶

  • Input: Text documents
  • Text pre-processing, sentence segmentation
  • Text parsing, word tokenization, exploratory data analysis
  • Text representation, feature engineering $\longleftarrow \fbox{word embeddings}$
  • Modeling, pattern mining $\longleftarrow \fbox{word embeddings}$
  • Knowledge representation, ontology $\longleftarrow \fbox{word embeddings}$
  • Output: Evaluation, knowledgebase, deployment

Word2Vec¶

"You shall know a word by the company it keeps." J.R. Firth

Word2vec techniques use the context of a given word to learn its semantics. Also, Word2vec learns numerical representations of words by looking at the words surrounding a given word.

Imagine in an exam and the following sentence is encountered: "Mary is a very stubborn child. Her pervicacious nature always gets her in trouble." What does pervicacious mean?. The phrases surrounding the word of interest is important. In our example, pervicacious is surrounded by stubborn, nature, and trouble. These three words is enough to determine that pervicacious in fact means a state of being stubborn.

Gensim¶

Gensim = "Generate Similar" is a topic modeling library to implement Latent Semantic Methods and it is license under GNU LGPLv2.1 license.

Word2Vec module of gensim can generate CBOW and skip-gram models. Here is the API.

Note that Word2Vec is not removing stop words because the algorithm relies on the broader context of the sentence in order to produce high-quality word vectors.

Doc2Vec¶

Word2vec processes text by vectorizing words and generates feature vectors that represent words in the corpus. Similarly, Doc2Vec processes the entire variable-length document by vectorizing documents into fixed length feature vectors. Doc2Vec uses a similar scheme to Word2Vec by extending the skip-gram or CBOW with an additional document/paragraph vector $D$. During the training of words as in Word2Vec the document vector $D$ is also trained and thus it represents the document. Here is the API.


Conda Environments¶

A conda environment is a directory that contains a specific collection of conda packages that is installed. As an example, one environment with NumPy 1.7 and its dependencies, and another environment with NumPy 1.6 for legacy testing can exist on the same computing platform. When one environment is updated, others are not affected. We can easily activate or deactivate environments, to switch between environments.

Before running the jupyter notebook we have to activate the environment. In the following example we used the name gensim as the environment name, activated it, and then installed the gensim library in that environment.

  1. Creating an environment named gensim: conda create -n gensim
  2. Activate the gensim environment: conda activate gensim
  3. Installing gensim: conda install gensim
  4. Install more: conda install -c conda-forge python-levenshtein
  5. Reinstall jupyter notebook: conda install jupyter
  6. Reinstall nltk: conda install nltk and thereafter every necessary library
  7. Then run the jupyter notebook

Deactivating the environment: source deactivate
Deleting the environment: conda remove -n gensim -all


Let's load nltk dataset named abc and use gensim to generate word embeddings with CBOW approach.

In [1]:
%%time

import nltk
import gensim
print(f'gensim version= {gensim.__version__}')
from gensim.models import Word2Vec

from nltk.corpus import abc

sents = list(abc.sents())

model = Word2Vec(abc.sents(), min_count=2, workers=4)
X = list(model.wv.index_to_key)

# Sanity
print(f'ABC dataset has {len(sents)} sentences')
print(f'gensim model vocabulary has {len(X)} words mapped to N= {model.vector_size} dimensions')
gensim version= 4.3.0
ABC dataset has 29059 sentences
gensim model vocabulary has 19484 words mapped to N= 100 dimensions
CPU times: total: 11 s
Wall time: 8.04 s
In [2]:
# The closest words to the word 'science'
science = model.wv.most_similar('science')
print(science)
[('agriculture', 0.962886393070221), ('Coalition', 0.9452241063117981), ('law', 0.943227231502533), ('management', 0.9409330487251282), ('textile', 0.9401405453681946), ('biosecurity', 0.9397327899932861), ('general', 0.9383535981178284), ('descend', 0.9369604587554932), ('bulk', 0.936199963092804), ('education', 0.9359628558158875)]
In [3]:
# Distance between computer and science
science12 = model.wv.similarity('science', 'computer')
print(science12)
0.7867609

Let's see another example from Shakespeare's play Hamlet using CBOW and skip-gram methods, respectively.

In [4]:
from nltk.corpus import gutenberg

sents = list(gutenberg.sents('shakespeare-hamlet.txt'))
print(sents[0])
['[', 'The', 'Tragedie', 'of', 'Hamlet', 'by', 'William', 'Shakespeare', '1599', ']']
In [5]:
%%time

# CBOW model
model1 = Word2Vec(sents, vector_size=200, sg=0, window=13, min_count=1, epochs=20, workers=4)

# Skip-gram model
model2 = Word2Vec(sents, vector_size=200, sg=1, window=13, min_count=1, epochs=20, workers=4)
CPU times: total: 5.98 s
Wall time: 1.92 s

Let's find out what the model tells us when the context is the names: ['Hamlet', 'Ophelia', 'Ghost']

In [6]:
similarities1b = model1.wv.most_similar(positive=['Hamlet'], topn=20)
similarities1 = model1.wv.most_similar(positive=['Hamlet', 'Ophelia', 'Ghost'], topn=20)
similarities2 = model2.wv.most_similar(positive=['Hamlet', 'Ophelia', 'Ghost'], topn=20)
In [7]:
# Clean the stop words   
def filter_words(_sim):
    from nltk.corpus import stopwords
    import re
    stop_words = set(stopwords.words('english'))
    return [(w,p) for w,p in _sim if w.lower() not in stop_words and re.search(r'^[a-zA-Z]{3,}$',w) != None]

similarities1b = filter_words(similarities1b)
similarities1 = filter_words(similarities1)
similarities2 = filter_words(similarities2)
In [8]:
for (w1,s1),(w1b,s1b),(w2,s2) in zip(similarities1, similarities1b, similarities2):
    print(f'{w1:16s}{s1:.3f}\t\t{w1b:16s}{s1b:.3f}\t\t{w2:16s}{s2:.3f}')
Horatio         0.993		Ghost           0.992		Rosincrane      0.908
Manet           0.991		Manet           0.986		Manet           0.852
Reynoldo        0.988		Horatio         0.986		Claudius        0.845
goodnight       0.988		Ophelia         0.985		Voltumand       0.842
shout           0.988		goodnight       0.979		Sister          0.840
Thankes         0.987		shout           0.978		Attendant       0.835
twaine          0.987		Noise           0.978		Queene          0.828
Voltemand       0.986		Rosincrane      0.978		Marcellus       0.818
Rosincrane      0.986		Saylor          0.977		Guildenstern    0.818
afarre          0.985		Reynoldo        0.976		bloody          0.818
vnknowne        0.985		Voltemand       0.976		Polonius        0.811
standing        0.984		twaine          0.976		Laertes         0.810
Noise           0.984		Thankes         0.975		Welcome         0.808
Lights          0.984		sickly          0.975		Osricke         0.807
Osricke         0.984		afarre          0.974		Drumme          0.807
Goodnight       0.984		Laertes         0.974		Gertrude        0.806
Ruine           0.984		standing        0.974		Coffin          0.802
Barnardo        0.983		vnknowne        0.973		Farewell        0.800
In [9]:
# The word embedding matrix
words1a = [w for w,s in similarities1] + ['Hamlet']
X1a = model1.wv[words1a]

words2a = [w for w,s in similarities2] + ['Hamlet']
X2a = model2.wv[words2a]

# Sanity
print(X1a.shape)
(20, 200)

Visualizing the Word Embedding¶

We can use classical projection methods to reduce the high-dimensional word vectors to two-dimensional plots using PCA. The visualizations can provide a qualitative diagnostic for our learned model.

Let's train a projection method on the vectors.

In [10]:
from sklearn.decomposition import PCA

pca_model1a = PCA(n_components=2).fit_transform(X1a)

pca_model2a = PCA(n_components=2).fit_transform(X2a)
In [11]:
%matplotlib inline
import matplotlib.pyplot as plt

def plot_pca(_pca_model, _words, _title):
    plt.scatter(_pca_model[:, 0], _pca_model[:, 1])
    for i, word in enumerate(_words):
        plt.annotate(word, xy=(_pca_model[i, 0], _pca_model[i, 1]), c=('r' if word=='Hamlet' else 'k'))
    plt.title(_title)

plt.figure(figsize=(12, 6), dpi=72)
    
ax=plt.subplot(1, 2, 1)
plot_pca(pca_model1a, words1a, 'CBOW Model')

ax=plt.subplot(1, 2, 2)
plot_pca(pca_model2a, words2a, 'Skip-gram Model')

plt.show()
In [12]:
dissimilarities1 = model1.wv.most_similar(negative=['Hamlet'], topn=20)
dissimilarities2 = model2.wv.most_similar(negative=['Hamlet'], topn=20)

words1b = ['Hamlet'] + [w for w,s in similarities1] + [w for w,s in dissimilarities1]

X1b = model1.wv[words1b]

words2b = ['Hamlet'] + [w for w,s in similarities2] + [w for w,s in dissimilarities2]

X2b = model2.wv[words2b]

pca_model1b = PCA(n_components=2).fit_transform(X1b)
pca_model2b = PCA(n_components=2).fit_transform(X2b)
In [13]:
plt.figure(figsize=(20, 10), dpi=72)

ax=plt.subplot(1, 2, 1)
plot_pca(pca_model2a, words2a, 'CBOW Model')

ax=plt.subplot(1, 2, 2)
plot_pca(pca_model2b, words2b, 'Skip-gram Model')

plt.show()
In [14]:
print(model1.wv.most_similar(positive=['Alas', 'poor', 'Yorick', 'Horatio'], topn=20))
[('sweet', 0.995482325553894), ('Gertrude', 0.9953031539916992), ('Sister', 0.9948393106460571), ('Roughly', 0.9948161840438843), ('false', 0.9947570562362671), ('O', 0.99473637342453), ('Oh', 0.9947359561920166), ('yong', 0.994587242603302), ('Are', 0.9944217205047607), ('ioyes', 0.9943819642066956), ('cheerefully', 0.9943240284919739), ('!', 0.9942605495452881), ('hoa', 0.9941666722297668), ('awake', 0.9941198825836182), ('sore', 0.9940646290779114), ('teares', 0.9940516352653503), ('home', 0.9938029050827026), ('looke', 0.9936867356300354), ('exception', 0.9936301112174988), ('Scull', 0.9935811161994934)]
In [15]:
print(model1.wv.most_similar(negative=['Alas', 'poor', 'Yorick', 'Horatio'], topn=20))
[('Conference', 0.9336201548576355), ('Hiperion', 0.9247527122497559), ('Siluer', 0.9054037928581238), ('range', 0.8802468776702881), ('Happy', 0.8748757243156433), ('Satyre', 0.8745262622833252), ('threats', 0.8663226366043091), ('scann', 0.8452069759368896), ('Scourge', 0.8443164825439453), ('Rood', 0.8409736752510071), ('months', 0.8404901027679443), ('punish', 0.8375216126441956), ('Sphere', 0.8307065367698669), ('truster', 0.830676257610321), ('Minister', 0.7938308715820312), ('Lunacie', 0.7803285121917725), ('trifling', 0.7726242542266846), ('space', 0.7693579792976379), ('vttered', 0.7597777843475342), ('Greefes', 0.7589038014411926)]

Do you notice any meaningful word from the context ['Alas', 'poor', 'Yorick', 'Horatio'] as in above?

What is that 'O'?


Example News Category Classification¶

In the previous lectures we used Tf-Idf features and classified six news categories in Reuters corpus. Now let's see how we could use Word2Vec generated features.

The word embeddings being size $N$, given a set of vector embeddings $v_i, i=0\dots k$ from a document $d$ with $k$ words, and its feature vector $\text{fv}$,

  • Take the mean of the vectors, $d_\text{fv} = \operatorname{mean}(v_i)$
  • Take the minimum of the vectors, $d_\text{fv} = \operatorname{min}(v_i)$
  • Take the maximum of the vectors, $d_\text{fv} = \operatorname{max}(v_i)$
In [16]:
# borrowed from previous lectures
from nltk.corpus import reuters
from collections import Counter
import numpy as np
import pandas as pd

Documents = [reuters.raw(fid) for fid in reuters.fileids()]

# Categories are list of lists since each news may have more than 1 category
Categories = [reuters.categories(fid) for fid in reuters.fileids()]
CategoriesList = [_ for sublist in Categories for _ in sublist]
CategoriesSet = np.unique(CategoriesList)

print(f'N documents= {len(Documents):d}, K unique categories= {len(CategoriesSet):d}')

counts = Counter(CategoriesList)
counts = sorted(counts.items(), key=lambda pair: pair[1], reverse=True)

# Build the news category list
yCategories = [_[0] for _ in counts[:5]]
yCategories += ['other']

# Sanity check, M=29K
print(f'K categories for classification= {len(yCategories):d} {yCategories}')
N documents= 10788, K unique categories= 90
K categories for classification= 6 ['earn', 'acq', 'money-fx', 'grain', 'crude', 'other']
In [17]:
# Assign a category for each news text
yCat = []
for cat in Categories:
    bFound = False
    for _ in yCategories:
        if _ in cat:
            yCat += [_]
            bFound = True
            break  # So we add only one category for a news
    if not bFound:
        yCat += ['other']
        
# Sanity check
print(f'N categories= {len(yCat):d}')
N categories= 10788
In [18]:
# Convert to numerical np.array which sklearn likes
ydocs = np.array([yCategories.index(_) for _ in yCat])
In [19]:
from nltk import word_tokenize

Sentences = [word_tokenize(doc) for doc in Documents]
In [20]:
%%time

# CBOW model
model = Word2Vec(Sentences, vector_size=300, sg=0, window=9, min_count=1, epochs=20, workers=4)
CPU times: total: 1min 5s
Wall time: 17.8 s
In [21]:
# Use the mean of word vector that makes up a sentence or a document
# Note that there are better ways to use the word vector as a feature vector - such as doc2vec in gensim
Xdocs = np.array([np.mean([model.wv[word] for word in doc], axis=0) for doc in Sentences])

print(Xdocs.shape)
(10788, 300)
In [22]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score

def kfold_eval_docs(_clf, _Xdocs, _ydocs):
    # Need indexable data structure
    acc = []
    kf = StratifiedKFold(n_splits=10, shuffle=False, random_state=None)
    for train_index, test_index in kf.split(_Xdocs, _ydocs):
        _clf.fit(_Xdocs[train_index], _ydocs[train_index])
        y_pred = _clf.predict(_Xdocs[test_index])
        acc += [accuracy_score(_ydocs[test_index], y_pred)]
    return np.array(acc)
In [23]:
%%time

from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()
acc = kfold_eval_docs(nb, Xdocs, ydocs)

print(f'Naive Bayes CV accuracy= {np.mean(acc):.3f} {chr(177)}{np.std(acc):.3f}')
Naive Bayes CV accuracy= 0.693 ±0.032
CPU times: total: 328 ms
Wall time: 306 ms
In [24]:
%%time

from sklearn.ensemble import RandomForestClassifier

n_cores = 8
rf = RandomForestClassifier(n_jobs=n_cores, n_estimators=300, max_depth=10, random_state=None, class_weight='balanced')
acc = kfold_eval_docs(rf, Xdocs, ydocs)

print(f'Random Forest CV accuracy= {np.mean(acc):.3f} {chr(177)}{np.std(acc):.3f}')
Random Forest CV accuracy= 0.885 ±0.015
CPU times: total: 9min 4s
Wall time: 1min 10s
In [25]:
%%time

from sklearn.svm import SVC

svm = SVC(kernel='rbf', gamma='scale', class_weight='balanced')
acc = kfold_eval_docs(svm, Xdocs, ydocs)

print(f'Support Vector Machine CV accuracy= {np.mean(acc):.3f} {chr(177)}{np.std(acc):.3f}')
Support Vector Machine CV accuracy= 0.889 ±0.013
CPU times: total: 26.1 s
Wall time: 26.1 s
In [26]:
%%time

import warnings
from sklearn.exceptions import ConvergenceWarning
from sklearn.linear_model import LogisticRegression

# To avoid non-convergence one has to increase 'max_iter' parameter
lr = LogisticRegression(solver='sag', multi_class='auto', max_iter=500, class_weight='balanced')
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=ConvergenceWarning)
    acc = kfold_eval_docs(lr, Xdocs, ydocs)

print(f'Logistic Regression CV accuracy= {np.mean(acc):.3f} {chr(177)}{np.std(acc):.3f}')
Logistic Regression CV accuracy= 0.899 ±0.009
CPU times: total: 2min 20s
Wall time: 2min 8s

Notice In this seen-before problem, the classification performance is almost as good as the previous lectures and it runs much faster since we have only 300 vector size (original M was 29016).


Example News Category Classification using Doc2Vec¶

In the previous cell we used Word2Vec features and classified six news categories in Reuters corpus. Now let's see how we could use Doc2Vec generated features.

In [27]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

# Doc2Vec expects TaggedDocument input data structure, every document is a list of words and tagged with int ID
DocumentsTagged = [TaggedDocument(word_tokenize(reuters.raw(fid)), [i]) for i, fid in enumerate(reuters.fileids())]

model2 = Doc2Vec(DocumentsTagged, vector_size=100, window=9, min_count=1, epochs=20, workers=4)
In [28]:
# Build X from Doc2Vec document vectors
Xdocs2 = np.array([model2.dv[_.tags[0]] for _ in DocumentsTagged])

print(Xdocs2.shape)
(10788, 100)
In [29]:
rf = RandomForestClassifier(n_jobs=n_cores, n_estimators=300, max_depth=10, random_state=None, class_weight='balanced')
acc = kfold_eval_docs(rf, Xdocs2, ydocs)
print(f'Random Forest CV accuracy= {np.mean(acc):.3f} {chr(177)}{np.std(acc):.3f}')

acc = kfold_eval_docs(svm, Xdocs2, ydocs)
print(f'Support Vector Machine CV accuracy= {np.mean(acc):.3f} {chr(177)}{np.std(acc):.3f}')
Random Forest CV accuracy= 0.730 ±0.015
Support Vector Machine CV accuracy= 0.774 ±0.018
In [30]:
# Try another model
model3 = Doc2Vec(DocumentsTagged, vector_size=100, dm=0, window=9, min_count=1, epochs=20, workers=4)

Xdocs3 = np.array([model3.dv[_.tags[0]] for _ in DocumentsTagged])
print(Xdocs3.shape)

rf = RandomForestClassifier(n_jobs=n_cores, n_estimators=300, max_depth=10, random_state=None, class_weight='balanced')
acc = kfold_eval_docs(rf, Xdocs3, ydocs)
print(f'Random Forest CV accuracy= {np.mean(acc):.3f} {chr(177)}{np.std(acc):.3f}')

acc = kfold_eval_docs(svm, Xdocs3, ydocs)
print(f'Support Vector Machine CV accuracy= {np.mean(acc):.3f} {chr(177)}{np.std(acc):.3f}')
(10788, 100)
Random Forest CV accuracy= 0.882 ±0.012
Support Vector Machine CV accuracy= 0.911 ±0.012

Notice In this approach the classification performance is almost as good as the previous results (or better) and it runs much faster since we have only 100 vector size (original M was 29016).


References¶

  1. Almeida, Felipe, and Geraldo Xexéo. "Word embeddings: A survey." arXiv preprint arXiv:1901.09069 (2019).

Exercises¶

Exercise 1. So many different approaches can be utilized using the Word2Vec generated word vectors and document vectors, such as generating the minimum and maximum vector values (magnitude-wise), doubling the dimension of the feature vectors (from $M$ to $2M$), etc.

  • Research and find the difference (perhaps subjective) between applications of CBOW and Skip-gram
    • Attempt improving classifier performance using such approaches
  • Attempt increasing the word embeddings size to 1000. Perhaps reducing the dimensions is more meaningful?
  • Map the terms 'derived' and 'enriched' to the cases (1.) when using generated Word2Vec in this dataset as opposed to (2.) using GloVe word embeddings.
  • Research Top2vec and compare to Doc2vec

In [31]:
%%html
<style>
    table {margin-left: 0 !important;}
</style>
<!-- Display markdown tables left oriented in this notebook. -->